Explore Red Wine Data by Victor Lambert

This dataset contains 1599 records of red wine chemical composition and quality rating. The wine is the red Vinho Verde out of northern Portugal. There are 12 variables in the dataset, 11 chemical properties of the wine and quality as reported by three different critics.

Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3) 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Questions we will investigate during this exploration: - What are the variables that impact wine quality? - Can we use the data to predict wine quality? - Are there interesting correlations between other measurments?

Univariate Plots Section

In this section we look into the distribution of single variables.

Quality Distribution

## [1] "Quality Summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

It is clear from the max and min that there aren’t any wines above 8 or below 3. Also, since the first and third quantile are 5 and 6 respectively we expect the majority of the values to fall between 5 and 6. Since it is very discrete a table of all the values could be useful.

## [1] "Quality Table"
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Here it is clear that we should combine some bins in order to have a better significance for those areas. We create four bins for wine quality: - poor wines 3 and 4 - low-mediocre wines 5 - high-mediocre wines 6 - good wines 7 and 8 These bins can be used to split other variables and see if there are patterns.

## [1] "Quality Bins Table"
## 
## (2,4] (4,5] (5,6] (6,8] 
##    63   681   638   217

As we expected, most of the values are either 5 or 6 for quality. There are quite a few 7 wines also which might be helpful for identifying trends. We will look at the distribution in a histogram next.

Initially, looking at the distribution of quality we see something close to a normal distribution. The vast majority of the wines are in the middle of the spectrum, and there are not many wines at the extremes. We will try to find any variables which can help us predict this.

We will go through the available variables, paying particular attention to ones which might be of particular interest.

Acidity Metrics

Looking at the three different types of acid, the fixed and volatile acidity are clearly close to a normal distribution. The citric acid, however doesn’t look like a normal distribution. We will look at the log of the citric acid content.

The citric acid content does seem more normal when plotted as a log, but still not perfect since it is skewed by some low values. For the last acidity related metric we will look at pH.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The pH follows a very normal distribution with almost all of the values between three and four. This likely correlates well with the acidity variables, we will look into this more in the bi and multi variate analysis section.

Inorganic Molecule Analysis

Next we move on to analyses of the inorganic compounds reported in the dataset. First, we investigate the chloride content.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

We see that the majority of chloride concentrations lie between zero and 0.2, but there are some values as high as 0.6. The median value is about 0.8 mg/L. We should zoom in on the are with most of the values to get a better idea of the distribution.

The distribution of chlorides looks much more normal in that region. It seems to be a normal distribution with a long tail. We will have to check if the long tail values influence the quality in the two variable analysis later. Now, the sulfur dioxide, both free and total.

The data description states that free SO2 is the type which would impact the flavor of the wine, particularly if it is too high. That threshold will have to be investigated later in the analysis. Another interesting thing to consider is the proportion of free over total sulfur dioxide in each wine. Perhaps that value will be similar in all wines.

It seems like the proportion varies widely, from .02 to 0.85, with most of the values being between 0.25 and 0.4. Now we can print some summary statistics for sulfur dioxide.

## [1] "Free SO2"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
## [1] "Total SO2"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
## [1] "Proportion Free"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02273 0.25930 0.37500 0.38230 0.48480 0.85710

We can see that the distributions match nicely with our observations from above. Now we will look into the sulphate contents of the wines.

From this histogram it seems that the sulphates follow a similar long tail distribution to many of the other salts contained in the wine. We will have to check later if the higher values have a larger impact on quality.

Bulk Property Analysis

Now, we will move on to the analysis of some bulk properties of the wine including sugar content, alcohol content, and density. First, we look at the distribution of residual sugar content. This is the amount of sugar remaining after the fermentation of the wine.

When looking at the sugar data, we see a simlar pattern to many other wines. Most fall within a lower bell curve and there is a long tail of much sweeter wines. The sugar content of the final wine is usually set by the winemaker, so it is possible that most people are targeting a dryer wine in that particular RS region. We can split the data across different quality bins to see if we can find a pattern.

There is not apparent a trend with sugar content and quality in the different groups of wines. Next we look at the distribution of alcohol contents.

Unlike the sugar content and many other variables, the alcohol content does not follow the long tail distribution. It seems that the alcohol distribution has a many more values in the long tail. This could be a result of the differing ripeness of grapes at harvest. If they are targeting a particular residual sugar value, the alcohol content will be primarily constrained by the initial sugar content of the grapes. We can split the alcohol content along different quality groupings as well.

Split over the four quality bins, it is clear that higher alcohol seems to correlate with better quality. We will have to investigate this relationship further. The alcohol is not necessarily the cause of the improvement, it could be that riper grapes or some other variable improve quality and higher alcohol is a byproduct of those. The final variable for us to investigate is density.

The density follows a very normal distribution, it should be closely related with alcohol and sugar content.

Univariate Analysis

We are able to draw some useful information from looking at the single variable analysis. We can see that the alcohol content likely correlates with better quality in the wine.

What is the structure of your dataset?

The data contains 12 variables, 11 of which are floating point numbers and one is an integer. It also contains an id, called ‘X’ which has been converted to a factor to prevent numerical analysis. The data already is tidy and does not have missing values or many other issues.

What is/are the main feature(s) of interest in your dataset?

The dataset consists of one important metric, the quality of the wine, and 11 other chemical variables about the wine. The main feature of interest is how we can predict the quality of the wine from some combination of these other chemicals.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

The dataset contains many interrelated variables, we will also look to see how these variables vary together to supplement the quality based analysis.

Did you create any new variables from existing variables in the dataset?

Yes, two new variables were created. The first, quality.bin groups the 3 and 4 values together, and the 7 and 8 values together. This is to increase the number of wines in the high quality and low quality bins for the purpose of data splitting. The second created variable is the proportion of SO2 that is free, which didn’t prove to be particularly helpful yet.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

There were many variables which had long tail distributions, although that was the most unusual thing found so far. No further tidying or adjustment has been done on the data. Now, we will move on to analysis of two variables.

Bivariate Plots Section

First, we can take a look at a ggpairs plot to see if there are any unexpected relationships that we should investigate later. The plot shows correlations on the upper half, frequency polygons on the diagonal, and scatter plots on the lower diagonal.

Looking at the ggpairs, we can see some expected relationships and some unexpected ones. Several variables have correlations with quality, although none of the correlations are large. The biggest effect is alcohol with a correlation of 0.48. The other larger correlations are volatile acidity (-0.39), sulphates (0.25) and citric acid (0.22). This makes sense from what we know about these acids, citric acidhas a pleasant flavor but acetic (volatile) acid does not.

As far as other variables go, we can see that the different acid variables and pH are also correlated, although the volatile acididy correlation is the opposite of what we would expect since lower pH is more acidic. It also seems that acidict has a relationship with density which I did not predict in the univariate analysis.

Let’s take a closer look at how some of these things relate to quality, starting with alcohol. Note that when plotting with quality as the dependent variable adding jitter and some transparancy is very helpful to visualize the data.

Here we an see the relationship that was hinted at in the univariate analysis. Wines with higher alcohol definitely have a higher quality average. The effect is not strong, but there are not many better than average wines for us to draw conclusions from.

Taking a look at this from a different plot, we can see that the higher quality wines definitely have a higher value for alcohol. The box plot is nice because it shows us the median as well as some grouping statistics.

Looking at the citric acid, we can see that the correlation is small. It seems that there are more high quality wines with higher citric acid, and more low quality wines with low citric acid.

For volatile acid, the relatinship is more clear. Having more volatile acid is detrimental to wine quality. Now let’s take a look at the sulphates.

I wasn’t expecting the sulphates to have a positive relationship. Sulphates can have a preservative effect, so perhaps they are preserving some of the fresh flavor of the wines. Let’s take a look at the sulfur dioxide.

This result is somewhat unexpected. We do expect sulfur dioxide to impact quality, since higher concentrations can give an off smell. However, the data documentation says that only free sulfur dioxide should impact the smell. However, it seems like total sulfur dioxide has a much bigger impact on the quality rating of the wine. Let’s move on toconfirm some relationships that we don’t expect to have an effect.

As we expected from the univariate analysis, sugar content doesn’t seem to have any impact at all on the quality of the wine. Now we can take a look at relationships between some other variables.

We can see that the density has a positive correlation with sugar, as we would expect. However, this doesn’t explain all the variation.

Taking a look at the effect that alcohol has on density, it’s clear that the lower density of ethanol is decreasing the density of lots of the wines. It’s likely that alcohol and sugar explain much of the density variation for the wines, we an look at them all together in the multivariate analysis.

Taking a look at the acidity, we see that fixed acidity has a much higher effect on pH than volatile acidity. We actually see the opposite effect in volatile acidity, which is unexpected. Possibly the fixed acidity, which has a much larger magnitude is affecting the pH more.

Taking a look at fixed acidity vs volatile acidity, we do see that wines with more volatile acidity tend to fall lower on the fixed acidity scale. This probably explains a large portion of the unexpected pH trend. Another thing we can take a look at is the relationship between sulphates and sulfur dioxide.

## 
##  Pearson's product-moment correlation
## 
## data:  wd$total.sulfur.dioxide and wd$sulphates
## t = 1.7178, df = 1597, p-value = 0.08602
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.006087119  0.091774762
## sample estimates:
##        cor 
## 0.04294684
## 
##  Pearson's product-moment correlation
## 
## data:  wd$free.sulfur.dioxide and wd$sulphates
## t = 2.0671, df = 1597, p-value = 0.03888
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.002643125 0.100424406
## sample estimates:
##        cor 
## 0.05165757

Though the data description suggests that sulphates can have a large impact on the amount of sulfur dioxide, the data suggests there is not much correlation.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

There are several factors which impact the quality rating of the wine. Higher alcohol content was once again shown to be correlated with a higher quality of the wine. Some other variables were also associated with better quality, including citric acid and sulphate content. The sulphate content was surprising to me, but could be a result of it acting as a preservative.

We also saw that volatile acidity and total sulfur dioxide content had a negative correlation with quality rating in wine. The effect of total sulfur dioxide was bigger than the effect of free sulfur dioxide, which was unexpected.

Overall, alcohol is the strongest positive predictor of quality, however there are not any very strong correlations for the quality rating.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I also looked into the relationship between some bulk properties of the wine, particularly the density. It seems like the alcohol content is a large predictor of lower density, which follows according to the lower density of ethanol. As a side note, as a chemical engineer I know that this relationship is not linear, though at these concentrations it shouldn’t be a big deal. The sugar content also has an impact on the density, though not as large as the alcohol content.

pH and the contribution of fixed and volatile acidity were also examined. There was one surprising result of volatile acidity, but it seems to be explained by the higher magnitude of the fixed acidity.

What was the strongest relationship you found?

The strongest correlation I found is the relationship between acidity and pH with a correlation of 0.68. Density and alcohol also have a very high correlation. The strongest correlation for the primary feature of the data is the alcohol content, with a correlation of 0.48.

Multivariate Plots Section

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

One thing I want to investigate here is the idea of balance in the flavor of wine. Balance is the amount of acidity which counters the sweetness of the wine, and it should be possible to look at the ratio of different types of acid and residual sugar to see if there is an impact on the perceived quality.

Plotting the sugar and different types of acidity on a scatter plot with different colors for quality bins, it doesn’t seem clear initially that there is a relationship. Let’s take a look at some box plots.

Looking at the box plots of fixed acidity vs quality, we see that there isn’t much of a relationship. This means that the balance of these wines probably isn’t very different for each quality. Let’s take a look at the volatile acidity next.

Looking at the sugar/volatile acidity, there definitely seems to be a relationship here. However, we know already that wines with lower volatile acidity tend to have higher quality. This could be the cause of the relationship we see here. We can calculate the correlation of these different variables with quality as a test.

## 
##  Pearson's product-moment correlation
## 
## data:  wd$volatile.acidity and wd$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## 
##  Pearson's product-moment correlation
## 
## data:  wd$residual.sugar/wd$volatile.acidity and wd$quality
## t = 8.2676, df = 1597, p-value = 2.843e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1551126 0.2491401
## sample estimates:
##       cor 
## 0.2025932

Here we see that the lower volatile acidity is more strongly correlated with higher quality than increased sugar/volatile acid ratio. We can’t say for sure which variable causes the effect but it is possible that the balance is not the primary cause and we are only seeing the impact of high volatile acidiy. In summary, it seems like we can’t a significant correlation between the ‘balance’ of the wine and its quality rating.

Next we will move on to analyzing the quality of the wine with respect to alcohol content, trying to see if any variables have a synnergystic effect. First, we look at citric acid and alcohol with the points colored by quality.

Looking closely at this plot, we can see that many of the high quality wines seem to have both higher alcohol and higher citric acid. Also, many of the lower quality wines cluster in the bottom left, so the lower of both quantities also isn’t beneficiial for wine quality.

Taking a look at a slightly more complex version of the plot, we can see the variation for different bins of wine quality. It definitely seems like the higher citric acid is beneficial, as well as the higher alcohol.

We see a similar relationship when we look at sulphates, alcohol, and quality. This is not surprising since alcohol and sulphates are not correlated, but they do correlate with higher quality ratings.

Taking a look at the total sulfur dioxide, alcohol, and quality. We expect to see a threshold above which the wine quality starts to suffer, however this does not seem to be the case. Alcohol content is a much stronger predictor than sulfur dioxide content.

Though the data documentation suggests that wines over 50ppm free sulfur dioxide will siffer, that doesn’t seem to be the case here (although there aren’t many samples with SO2 free > 50ppm).

Next, we will take a look at the density relationship again to see if we can explain it further. We had previously seen that there was a close relationship with alcohol content and density. We can add the acidity into that relationship.

Looking at this plot, the fixed acidity explains a large part of the additional variation in density!

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Most of the individual relationships related to the quality of the wine are fairly weak. Even when looking in to relationships of acidity, sugar, and many other variables paired with alcohol no significantly stronger correlations are found.

When combining different variables you can see regions of the graph for different quality bins, however the effects do not seem to increase the strength of the other.

Were there any interesting or surprising interactions between features?

One relationship that was particularly strong relative to the others is the relationship between density, alcohol, and fixed acid. This makes scientific sense because ethanol is less dense than water, but some acids are more dense. Looking at them in a group of three, there is clear trend both with alcohol decreasing the density but fixed acidity increasing the density.


Final Plots and Summary

Plot One

Description One

The first very important plot, although it is simple, is the distribution of quality score (1-10) in the dataset. We can see that there are very few wines with quality scores of 3, 4, 7, and 8. This shows that the wines follow a somewhat normal distribution in quality, however we have very little signal on what makes an excellent or extremely poor wine. This is important because it limits the future analysis that is possible, there is significant noise on most factors and without many points in the high or low rating it is very difficult to find strong correlations.

Plot Two

Description Two

There is one relationship related to the quality of wine that was moderately strong, higher alcohol wines tend to get higher quality scores. Looking at this plot it is clear that higher scoring wines also have higher percentages of alcohol. Lower scoring wines tend to be between 9 and 12 percent, while high scoring wines reach as high as 14 percent. Although this is the strongest relationship in the dataset, it is only a moderate correlation with a pearson R of about 0.47.

Plot Three

Description Three

The last plot shows something not related to quality, but how concentrations of fixed acids and alcohol impact the density. These are some of the strongest relationships in the dataset, showing how increasing alcohol decreases the density of the wine but increasing acidity increases the density. These correlations are clear from the scatter plot where the gradient color of the point corresponds to acidity.


Reflection

As a wine lover and data analyst, I was really hoping to find some very clear and interesting relationships in this dataset. However, the low number of high and low rated wines made it difficult to identify strong effects from a variety of factors. There are several things which do impact the wine quality, but only slightly.

With this in mind, the dataset is not extremely useful from the perspective of a chemical engineer trying to find compounds to increase or tweaks to wine contents (other than increase the alcohol). It seems like it could be more suited for creating a model that can try to predict the quality of wine from a few chemical inputs. However, the relationships are weak so the model would benefit from significanly more samples. Perhaps the white wine database, which has about 5000 samples, could be used to do a stronger analysis.

There were several surprising things in this dataset, the first of which was the general lack of strong correlations. This made it difficult for me to direct my exploration and left me often just checking lots of variables and not finding a strong trend. However, the data was numeric for lots of variables which allows for creation of many nice box and scatter plots. Overall, the exploration went well but could easily be improved in the future.